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Abstract 

The  USAF  generally  does  not  know  the  reliability  of  its  fielded  repairable 
systems.  The  reported  metric,  Mean  Time  Between  Failure  (MTBF),  is  too  lagging  to  be 
actionable  in  the  best  case,  and  is  not  representative  of  actual  system  reliability  in  the 
worst  case.  This  thesis  investigates  the  statistical  techniques  for  measurement  and 
analysis  of  the  reliability  of  fielded  repairable  systems,  which  are  very  different  than 
nonrepairables.  To  frame  the  investigation,  a  comparison  is  made  between  the  generally 
accepted  definitions  and  metrics  and  those  used  across  the  US  Air  Force  (USAF). 
Reliability  can  be  analyzed  in  four  context  areas:  reliability  prediction  of  nonrepairable 
and  repairable  items  and  reliability  measurement  of  nonrepairable  and  repairable  items. 
This  research  is  focused  on  the  latter.  An  algorithmic  process  for  effective  measurement 
of  reliability  of  fielded  repairable  USAF  systems,  based  on  recurrent  event  analysis,  is 
proposed  and  demonstrated  using  a  non-parametric  approach  on  USAF  maintenance  data. 
The  approach  provides  a  new  capability  that  can  identify  even  short  term  changes  in 
system  Rate  of  Occurrence  of  Failure  (ROCOF),  which  can  identify  daily  or  hourly 
trends  across  the  fleet  subsystems.  This  new  approach  is  compared  to  USAF  calculations 
of  MTBF  over  the  same  period. 
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EFFECTIVE  MEASUREMENT  OF  RELIABILITY  OF  REPAIRABLE  USAF 


SYSTEMS 


I.  Introduction 


General  Issue 

The  United  States  relies  on  complex  systems  to  protect  and  project  the  national 
interest.  These  systems  must  be  available  to  meet  the  operational  need.  The  necessary 
system  Operational  Availability  (A0)  is  calculated  from  the  overarching  system 
requirements.  The  system  reliability,  maintainability,  and  logistics  support  requirements 
are  subsequently  derived  from  the  A0  requirement. 

The  U.S.  Department  of  Defense  has  renewed  emphasis  on  reliability  as  the  major 
contributor  to  system  availability  and  to  the  operations  and  support  costs  associated  with 
sustainment  of  the  systems.  In  a  recent  memo  the  Director  of  Operational  Test  and 
Evaluation,  Office  of  the  Secretary  of  Defense,  stated, 

“Poor  reliability  is  a  problem  with  major  implications  for  cost. 

Sustainment  costs  have  five  to  ten  times  more  impact  on  total  life  cycle 
costs  than  do  RDT&E  costs.  Unreliable  systems  have  higher  sustainment 
costs  because,  quite  plainly,  they  break  more  frequently  than  planned. 

Poor  reliability  leads  to  higher  sustainment  costs  for  replacement  spares, 
maintenance,  repair  parts,  facilities,  staff,  etc.  Poor  reliability  hinders 
warfighter  effectiveness  and  can  essentially  render  weapons  useless.”  [1] 

When  systems  do  not  meet  the  required  availability  due  to  less  than  expected 
reliability  the  logistics  system  must  increase  the  flow  of  parts.  The  Department  of 
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Defense  (DOD)  supply  chain  spends  billions  of  dollars  to  purchase,  manage,  store,  track, 
and  deliver  spare  parts  and  other  supplies  to  keep  military  equipment  ready  and 
operating.  DOD  reported  that  it  managed  more  than  4  million  secondary  inventory  items 
valued  at  more  than  $91  billion  as  of  September  2009.  Secondary  inventory  items 
include  reparable  components,  subsystems,  and  assemblies  other  than  major  end  items 
(e.g.,  ships,  aircraft,  and  helicopters),  consumable  repair  parts,  bulk  items  and  materiel, 
subsistence,  and  expendable  end  items  (e.g.,  clothing  and  other  personal  gear).  [2] 

Effective  Supply  Chain  Management  (SCM)  requires  active  control  of  system 
performance.  Performance-based  sustainment  makes  business  sense  when  operation  and 
support  costs  are  significant  higher  than  acquisition  costs  and  sustainment  costs  can  be 
reduced  by  smarter  repairs.  [3]  Poor  system  perfonnance  (reliability)  drives  unnecessary 
repair  actions  and  cost  at  the  weapon  system  and  commodity  level.  Repair  is  the  single 
biggest  customer  of  (buying  components  and  subassemblies),  and  supplier  to  (selling 
repaired  commodities),  the  USAF  supply  chain.  The  current  USAF  repair  network 
includes  over  150  managers,  nearly  50,000  maintainers,  and  a  $14  billion  budget.  [4] 

To  enforce  the  emphasis  on  system  availability  Chairman  of  the  Joint  Chiefs  of 
Staff  Instruction  3 170.0  IF  mandates  use  of  Availability  Key  Perfonnance  Parameters 
(KPP)  and  Reliability  and  Ownership  Cost  Key  System  Attributes  (KSA).  [5]  The  Under 
Secretary  of  Defense  (USD)  for  Acquisition,  Technology,  and  Fogistics  (AT&F)  issued  a 
memorandum  that  defines  the  metrics  and  reporting  requirements  [6].  For  the  DoD  the 
reportable  metric  quantifying  materiel  reliability  is  Mean  Time  Between  Failure  (MTBF) 
further  defined  as  Operating  Hours/Failures. 
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Reliability  Definition 

A  widely  accepted  definition  of  reliability  is  the  probability  that  a  system  or 
product  will  perform  in  a  satisfactory  manner  for  a  given  period  of  time  when  used  under 
specified  operating  conditions  in  a  given  environment.  [7] 

Reliability  will  be  segmented  into  four  areas  as  presented  in  Table  1.  Primarily, 
this  thesis  focuses  on  the  Measurement  of  Recurrence  Data  (bottom  right  quadrant  of 
Table  1). 

Table  1.  The  Four  Context  Areas  of  Reliability  Analysis 


Prediction  Measurement 


Life  Data 
(throw  away 
items, 

nonrepairable) 

Recurrence  Data 
(repairable  items, 
systems) 


(Estimation  from  Probabilistic  Models)  (Data  from  Deployed  Systems) 


Traditional  focus  of  reliability 

Based  on  design,  part  selection,  and 
production  quality 

Mean  Time  To  Failure  (MTTF) 

Data  fit  to  known  distributions  for 
comparison  to  prediction 

Reliability  Block  Diagrams 

Stochastic  Point  Process  Models  (HPP, 
NHPP,  and  many  variations), 

Arrival  Interval  Analysis 

Recurrent  Event  Data  Analysis 
(nonparametric) 

Critical  data  is  ordered  sequence 
of  times  to  failures. 

Reliability  can  be  predicted  and  measured.  A  reliability  prediction  is  a  probability 
calculation  based  on  characteristics  of  the  design.  The  reliability  calculation  is  intended 
to  affect  the  design  to  meet  an  availability  requirement.  A  reliability  measurement  tracks 
failures  as  a  function  of  time  or  usage.  To  compare  the  measured  reliability  to  the 
predicted  reliability  definitions  of  the  system  or  product,  failure  modes,  period  of  time, 
operating  conditions,  and  environment  must  be  common  or  accounted  for.  The  technique 
for  comparison  of  measured  reliability  to  predicted  reliability  attempts  to  model  the 
failure  occurrences  as  a  parametric  function  related  to  the  predictive  model. 
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Problem  Statement 


The  DoD  and  USAF  measure  of  reliability  is  mean  time  between  failure  (MTBF) 
which  is  a  discrete  value  calculated  as  the  ratio  of  operational  time  to  failures  [8].  This 
definition  of  MTBF  is  an  oversimplification  that  makes  assumptions  about  the  failure 
distribution  that  may  not  be  accurate  or  intended.  The  necessary  assumptions  to  state 
MTBF  as  the  ratio  of  time  to  failures  are  not  supported  in  the  preponderance  of 
applications  [9]  [10]  [11].  To  make  credible  judgments  about  the  failure  distribution  the 
operating  time  to  failure  and  environment  must  be  tracked  for  individual  items.  The  AF 
does  not  have  a  process  or  system  to  effectively  or  accurately  track  the  perfonnance  of 
individual  items  or  material.  The  AF  is  not  applying  effective  processes  or  expertise  to 
analyze  the  available  field  reliability  data. 

Effective  measurement  of  reliability  requires  accurate  time  to  failure,  sequence  of 
time  to  failures,  and  failure  mode  data.  AF  policy  requires  monitoring  of  component 
configuration  and  in-system  performance  [12]  and  detailed  component  histories  but  a 
general  AF  data  system  does  not  exist  that  effectively  collects,  retains,  or  provides  access 
and  analysis  of  weapon  system  component  performance  data  and  histories. 
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Research  Objectives  and  Focus 

The  following  objectives  will  guide  this  thesis.  First  this  thesis  seeks  to 
summarize  the  research  into  the  general  (non-DoD)  basis  for  the  MTBF  calculation,  its 
use  as  a  specification  or  measurement  of  reliability  of  fielded  repairable 
systems/components,  and  alternative  methods  for  measurement  of  reliability  in  this 
context  (row  2  column  2  of  Table  1).  The  thesis  seeks  to  examine  DoD  and  USAF 
measurement  of  reliability  of  fielded  repairable  systems  and  components  data 
demonstrating  expected  shortfalls.  Lastly,  a  more  effectively  derived  measure  is  sought. 

The  focus  of  this  research  will  be  on  the  definition  and  nonparametric 
measurement  of  reliability  of  repairable  USAF  systems.  As  the  field  of  reliability  is 
expansive,  this  paper  will  not  deal  in  detail  with  reliability  prediction  methods  or  with 
parametric  statistical  modeling  of  failure  data. 

The  primary  research  question  is,  "Based  on  USAF  repairable  system  recurrence 
data,  how  can  reliability  best  be  non-parametrically  measured"? 
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Methodology  Overview 

1 .  A  review  of  literature  will  examine  the  generally  accepted  definition  of 
reliability  and  compare  to  the  DoD  reliability  definition.  The  review  will  briefly 
examine  the  current  state  of  reliability  prediction  and  measurement  in  the  four 
context  areas  shown  in  Table  1.  The  applicability  of  MTBF  and  Mean  Time  To 
Failure  (MTTF)  as  the  metric  to  define  reliability  of  repairable  and  nonrepairable 
items  will  be  examined. 

2.  The  DoD  and  Air  Force  definition  of  MTBF  will  be  considered.  The  AF  use  of 
MTBF  as  ‘the’  indicator  of  reliability  of  systems  and  commodities  will  be 
considered  in  the  context  of  the  current  literature. 

3.  The  accuracy  and  applicability  of  REMIS  data  as  the  authoritative  source  for  AF 
reliability  and  maintainability  data  will  be  examined  by  using  some  specific  data 
extraction  and  analysis  cases.  Some  methods  of  REMIS  data  analysis  that 
support  reliability  improvement  will  be  presented. 

Implications 

Reliability  of  fielded  repairable  and  complex  systems  in  use  in  the  DoD  is 
generally  not  known.  In  most  cases  the  reliability  metric  being  reported  (MTBF)  is  not 
accurate  and  in  the  worst  case  may  not  even  be  correlated  to  the  actual  system 
perfonnance. 
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Preview 


The  DoD  definition  of  reliability  and  the  requirement  for  measurement  of 
reliability  is  not  coherent.  MTBF  is  designated  as  the  measure  of  reliability  [13].  This 
paper  will  examine  a  more  concise  definition  of  reliability  and  the  need  for  a  less 
prescriptive  requirement  for  the  reliability  metric.  The  current  DoD  emphasis  is  on 
reliability  during  design  and  test.  Contracts  require  a  comprehensive  reliability  program 
with  defined  metrics.  Fault  And  Corrective  Action  Systems  (FRACAS)  are  required. 
These  specific  requirements  do  not  flow  into,  and  are  not  generally  measureable  in,  the 
Operations  and  Support  phase  of  the  programs. 

The  inadequacy  of  the  AF  Reliability  reporting  will  be  examined  using  some 
anecdotal  cases.  The  process  of  the  anecdotal  cases  will  be  related  to  the  general  case  of 
REMIS  inadequacy  as  a  source  for  reporting  component  reliability  or  for  root  cause 
analysis  of  system  reliability  issues. 

A  suggestion  for  analysis  of  existing  USAF  maintenance  data  that  would  provide 
a  more  effective  view  of  repairable  system  reliability  and  may  lead  to  AF  system 
reliability  improvement  will  be  presented. 
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II.  Literature  Review 


Chapter  Overview 

The  purpose  of  this  chapter  is  to  provide  a  reliability  definition  and  review  of  the 
basis  and  effectiveness  of  Mean  Time  Between  Failure  (MTBF)  as  the  measure  of 
reliability.  Brief  background  information  on  the  topics  context  areas  as  shown  in  Table  1 
is  presented.  A  more  detailed  review  of  literature  pertaining  to  the  process  for 
nonparametric  analysis  of  recurrence  data  of  deployed  systems  will  be  provided.  This 
will  be  the  context  used  for  the  methodology  and  data  analysis  chapters.  The  intent  is  to 
frame  in  the  reader1  s  mind  that  different  data  sets  and  different  statistical  processes  are 
required  in  each  context. 

Reliability  Definitions 

Reliability  of  systems  started  to  receive  serious  consideration  with  the  increasing 
complexity  of  weapon  systems  during  World  War  II.  A  widely  accepted  definition  of 
reliability  is  traced  back  to  the  Advisory  Group  on  the  Reliability  of  Electronic 
Equipment  (AGREE)  formed  by  the  U.S.  Department  Defense  in  1952.  A  1957  AGREE 
report  defined  reliability  as  the  probability  that  a  system  or  product  will  perform  in  a 
satisfactory  manner  for  a  given  period  of  time  when  used  under  specified  operating 
conditions  in  a  given  environment  [14].  Note  that  this  definition  has  four  important 
elements:  (1)  reliability  as  a  probability  distribution,  (2)  defined  satisfactory  performance, 
(3)  specific  operating  conditions,  and  (4)  specific  environment.  All  of  these  elements  are 
critical  to  an  unambiguous  definition  of  reliability  of  a  system  or  individual 
component.  [7] 
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The  definition  of  reliability  in  the  DoD  Guide  for  Achieving  Reliability, 
Availability,  and  Maintainability  and  in  MIL-STD-721  (cancelled  1995),  Definition  of 
Terms  for  Reliability  and  Maintainability,  includes  the  four  important  elements  of  the 
definition  above,  1.  “the  probability  of’  2.  “an  item  to  perfonn  a  required  function”  3. 
“under  stated  conditions”  4.  “for  a  specified  period  of  time.”  [15]  [16] 

The  USAF  definition  of  reliability  in  Air  Force  Instruction  21-118,  Improving  Air 
and  Space  Equipment  Reliability  and  Maintainability,  omits  the  probability  element  and 
introduces  the  generalization  of  reliability  as  MTBF,  “The  ability  of  a  system  or 
component  to  perfonn  its  required  functions  under  stated  conditions  for  a  specified  period 
of  time.  Usually  expressed  as  mean  time  between  failures  (MTBF).”  [17] 

The  Under  Secretary  of  Defense  (USD)  for  Acquisition,  Technology,  and 
Logistics  (AT&L)  issued  a  memorandum  defining  reliability  metrics  and  reporting 
requirements  [6],  That  memorandum  defines  Materiel  Reliability  as: 

Materiel  Reliability  is  a  measure  of  the  probability  that  the  system  will 
perfonn  without  failure  over  a  specific  interval.  Reliability  must  be 
sufficient  to  support  the  warfighting  capability  needed.  Material 
Reliability  is  generally  expressed  in  terms  of  a  mean  time  between 
failure(s)  (MTBF)  and,  once  operational,  can  be  measured  by  dividing 
actual  operating  hours  by  the  number  of  failures  experienced  during  a 
specific  interval 

The  USD  for  AT&L  definition  is  problematic.  It  states  that  Materiel  Reliability  is 
a  specific  probability  value,  of  zero  failures,  over  a  specific  interval.  It  does  not  mention 
specification/control  of  the  operational  conditions  or  environment. 
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The  USD  for  AT&L  memo  and  the  USAF  definition  say  that  Materiel  Reliability 
is  generally  expressed  in  terms  of  MTBF  and  describes  operational  Materiel  Reliability  as 
actual  operating  hours  divided  by  failures  in  a  defined  interval.  While  the  total  life 
operating  hours  divided  by  total  life  failures  is  the  literal  value  of  the  MTBF  of  a  system, 
in  practice  the  calculation  is  typically  applied  to  a  windowed  period  of  the  lifecycle 
operating  time  over  failures  (as  suggested  in  the  USD  Memo)  where  the  calculation  may 
not  be  applicable.  In  a  windowed  period  of  time  the  operating  hours  divided  by  failures  as 
the  mean  is  only  applicable  to  the  exponential  probability  distribution  where  the  failure 
rate  is  constant.  That  case  is  not  applicable  to  repairable  systems  as  will  be  discussed  later 
in  this  chapter  and  demonstrated  in  chapter  3. 

The  first  and  third  sentences  of  the  USD  memo  are  not  complimentary.  The 
material  Reliability  cannot  be  both,  “  . . .  the  probability  that  the  system  will  perform 
without  failure  over  a  specific  interval”  and  “expressed  in  terms  of  a  mean  time  between 
failure(s)  (MTBF)  and,  once  operational,  can  be  measured  by  dividing  actual  operating 
hours  by  the  number  of  failures  experienced  during  a  specific  interval.”  MTBF  is  a  single 
number  derived  from  the  total  lifecycle.  It  gives  no  information  about  the  probability  of 
failure  in  any  specific  interval  unless  the  specific  distribution  is  known. 

It  is  important  that  the  DoD/USAF  definition  of  reliability  be  applicable  and 
consistent  across  all  four  areas  of  reliability  shown  in  Table  1.  The  original  1957  AGREE 
definition  is  consistent  and  applicable  across  the  reliability  field. 
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Reliability  in  Context 

The  practice  of  reliability  prediction  and  measurement  involves  statistical 
modeling  and  analysis  of  data  on  time  to  occurrence  of  events  of  interest.  When  assessing 
reliability  it  is  important  to  make  the  distinction  between  nonrepairable  components  and 
repairable  systems,  life  data  verses  recurrence  data  [18]  as  represented  by  the  rows  of 
Table  1  (reproduced  here). 

Table  1.  The  Four  Context  Areas  of  Reliability  Analysis 


Life  Data 
(throw  away 
items, 

nonrepairable) 

Recurrence  Data 
(repairable  items, 
systems) 


Prediction  Measurement 

(Estimation  from  Probabilistic  Models)  (Data  from  Deployed  Systems) 


Traditional  focus  of  reliability 

Based  on  design,  part  selection,  and 
production  quality 

Mean  Time  To  Failure  (MTTF) 

Data  fit  to  known  distributions  for 
comparison  to  prediction 

Reliability  Block  Diagrams 

Stochastic  Point  Process  Models  (HPP, 
NHPP,  and  many  variations), 

Arrival  Interval  Analysis 

Recurrent  Event  Data  Analysis 
(nonparametric) 

Critical  data  is  ordered  sequence 
of  times  to  failures. 

That  distinction  is  often  omitted  as  the  terms  and  concepts  are  similar  and  the 
distinctions  are  subtle  [19].  According  to  Meeker  and  Escobar  [20]  the  important 
distinction  is  between  data  from,  and  models  for,  the  following: 

•  The  time  of  failure  for  nonrepairable  units. 

•  The  sequence  of  system  failure  times  for  repairable  systems. 

In  a  1970  IEEE  Transactions  on  Reliability  editorial  Mr.  Ralph  Evans  stated, 
“After  many  years  the  reliability  profession  is  still  in  sad  shape  with  regard  to 
understanding  its  basic  concepts.”  He  closed  the  article  with,  “One  cannot  guarantee  that 
wrong  answers  will  be  obtained  if  the  proper  model  is  not  analyzed,  but  one  can 
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guarantee  that  the  existing  literature  is  very  confusing  and  very  few  reliability  engineers 
really  understand  the  various  kinds  of  models  they  invoke  from  time  to  time.”  [21]  The 
editorial  was  republished  in  the  June  2000  IEEE  Transactions  on  Reliability  to,  “show 
that  many  things  do  not  change,  especially  where  people  and  their  beliefs  and  problems 
are  concerned.”  [22] 

John  Usher  pointed  out  in  1993  that  even  though  most  complex  systems  are 
repaired,  not  replaced;  the  statistical  methods  and  models  that  are  appropriate  only  for 
nonrepairable  systems  are  often  used  for  reliability  analysis  [23].  He  presents  failure  data 
from  a  repairable  system  and  shows  how  application  of  the  incorrect  analysis  (based  on 
an  iid  assumption)  provides  a  result  that  is  opposite  the  correct  result.  The  1984  Ascher 
and  Feingold  book,  Repairable  Systems  Reliability.  Modeling,  Inference,  Misconceptions 
and  Their  Causes  details  many  of  the  misconceptions  and  problems  associated  with 
treating  repairable  systems  reliability  data  as  if  it  were  from  a  nonrepairable  system  [24], 
Ascher  and  Christian  Hansen  presented  a  course,  Concepts  and  Models  for  Repairable 
Systems  Reliability,  at  the  2009  Centro  de  Investigacion  en  Mathematicas  (CIMAT).  The 
abstract  for  their  course  says  they  present  the  basic  concepts  and  models  for  parts 
(nonrepairable)  and  systems  (repairable)  and,  “stresses  their  up  to  infinite  differences, 
rather  than  their  superficially  striking  but  relatively  unimportant  similarities.”  [25] 

It  is  established  that  the  appropriate  statistical  processes  and  data  sets  for 
reliability  analysis  depend  on  whether  the  subject  is  nonrepairable  (life  data)  or  repairable 
(recurrence  data).  These  two  contexts  are  further  divided  by  the  purpose  of  the  analysis 
and  the  phase  of  the  product  lifecycle. 
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Parametric,  or  probabilistic,  models  are  used  to  predict  future  performance  and 
compare  alternative  designs  in  the  absence  of  complete  data.  Nonparametric  analysis  of 
operational  data  is  used  to  evaluate  current  and  past  reliability  perfonnance.  The 
prediction  of  future  perfonnance  based  on  a  design  process  verses  the  performance 
measurement  of  a  finite  population  of  fielded  units  may  be  similar  to  the  Deming 
categories  of  analytic  verses  enumerative  studies.  [26]  Deming’s  analytic  study  category 
is  statistical  analysis  of  the  processes  that  generate  units  over  time.  His  enumerative 
category  of  statistical  analysis  uses  data  from  identifiable  units  to  make  inferences  about 
the  larger  population. 

Life  Data  -  Nonrepair  able  Components  (Row  1  of  Table  1) 

Life  data  is  associated  with  nonrepayable  products  or  systems,  a  single  time  to 
event  for  each  of  a  population  of  like  units  (same  design,  material,  manufacturing 
processes),  usually  the  end  of  life  [27].  Most  reliability  literature  has  been  devoted  to  the 
modeling  and  analysis  of  life  data.  Statistical  software  packages  that  facilitate  analysis  of 
life  data  are  available  and  are  widely  used  for  reliability  analytics. 

For  nonrepairable  items  the  lifetime  is  a  random  variable.  The  failure  of  one  item 
does  not  affect  the  performance  of  another  item  in  the  same  population  so  the  assumption 
that  the  lifetimes  are  independent  is  reasonable.  If  the  population  is  produced  to  the  same 
design,  using  the  same  processes  and  materials,  it  is  also  reasonable  to  assume  the  item 
lifetimes  have  the  same  distribution.  These  two  assumptions  lead  to  the  basic  assumption 
that  the  lifetimes  are  independent  and  identically  distributed  (iid).  [19] 

To  assess  the  reliability  of  nonrepairable  items  the  failures  are  tracked  as  a 
function  of  usage,  usually  hours.  To  make  predictions  about  failures  the  data  is  then  fitted 
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to  a  Lifetime  Distribution  Model.  Mathematical  models  of  components  and  entire 
systems  may  be  produced  by  combining  models  of  many  failure  modes.  The  combination 
may  be  done  by  Monte  Carlo  simulation  or  by  analytical  methods.  System  models  are 
useful  for  predicting  spare  parts  usage,  availability,  maintainability,  and  support  costs. 
[28] 

The  following  definitions  are  from  the  Rigdon  and  Basu  textbook,  Statistical 
Methods  for  the  Reliability  of  Repairable  Systems  [19].  Under  the  iid  assumption  the 
lifetimes  have  a  corresponding  cumulative  distribution  function  (cdf)  F(t)  that  is  the 
probability  of  an  event  T,  that  an  individual  component,  or  the  ratio  of  the  total 
population  that,  will  fail  by  time  t. 

cdf  =  F(t)  =  P(T  <  t) 

Equation  1  Life  Data,  cumulative  density  function 

The  Reliability  Function  R( t),  sometimes  called  the  survival  function,  is  the 
probability  that  an  individual  component  will  survive  beyond  t.  Survival  and  failure  are 

mutually  exclusive  so  R(t)  =  1  -F(t). 

The  lifetime  distribution  model  is  a  probability  density  function  (pdf )f(t). 

pdf  sf(t)  =  ±F(,)=-±R(,) 

Equation  2.  Life  Data,  probability  density  function 

The  hazard  function  is  related  to,  but  distinct  from,  the  pdf. 

h(t)  =  f(t\T>t) 

Equation  3  Life  Data,  Hazard  Function 
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The  hazard  function  is  the  limit  of  the  probability  that  a  unit  fails  in  a  small 
interval  given  that  it  survived  to  the  beginning  of  the  interval.  If  the  hazard  function  is 
increasing  in  a  small  interval  it  means  that  the  probability  of  failure  is  increasing  with  the 
age  of  the  system.  The  nonrepairable  system  is  wearing  out.  A  nonrepairable  system  with 
a  decreasing  hazard  function  is  experiencing  burn-in  [19]. 

The  pdf  and  the  hazard  function  are  important  elements  to  define  the  reliability  of 
a  nonrepairable  item.  They  define  the  expected  life  and  the  probability  of  failure  in  an 
interval. 

Recurrence  Data  -  Repairable  Systems/Components  (Row  2  of  Table  1) 

Recurrence  data  consist  of  times  for  any  number  of  repeated  events  on  a 
population  unit,  for  example,  repairs  of  a  product.  For  a  repairable  system,  a  number  of 
failures  are  expected  for  a  single  system  [19].  Many  systems  and  repairable  components 
accumulate  repeated  repairs  over  time.  In  comparison  to  life  data,  analysis  of  recurrence 
data  is  underdeveloped. 

A  commonly  used  definition  of  a  repairable  system  [24]  is  a  system  which,  after 
failing  to  perfonn  one  or  more  of  its  functions  satisfactorily,  can  be  restored  to 
satisfactory  performance  by  an  action  other  than  replacement  of  the  entire  system.  Data 
from  repairable  systems  are  usually  given  as  ordered  failure  times  T\,  Ti,  .  .  .  with 
data  coming  from  a  single  system  or  from  several  systems  of  the  same  kind.  [29] 

Analysis  of  such  recurrence  data  requires  special  statistical  models  and  methods 
not  generally  covered  in  basic  reliability  books  [27]  [19].  Ascher  and  Feingold  [24]  wrote 
what  may  be  the  first  book  devoted  to  repairable  system  reliability  in  1984.  Their  book 
presents  the  case  that  researchers  and  practitioners  do  not  recognize  or  accommodate  the 
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crucial  differences  between  the  statistical  treatments  of  repairable  (life  data)  or 
nonrepairable  (recurrence  data)  systems.  They  used  examples  to  show  that  conclusions 
from  data  may  be  very  wrong  if  times  between  failures  are  treated  as  statistically 
independent  and  identically  distributed  (iid)  random  functions  when  the  assumption  is  not 
valid. 

Stochastic  point  processes  are  used  to  assess  the  reliability  of  repairable  systems. 
Failures  are  tracked  as  occurrences  of  events,  or  points,  in  time.  The  order  and  duration 
between  points  is  critical. 

The  following  definitions  of  functions  for  recurrence  data  are  from  the  Rigdon 
and  Basu  textbook,  Statistical  Methods  for  the  Reliability  of  Repairable  Systems  [19]. 
Assume  that  a  random  variable  N(t)  represents  the  number  of  failures  in  the  interval  [0,  /]. 
To  specify  a  stochastic  model  for  a  point  process  there  must  be  a  joint  distribution  of  the 
random  variables  N(ti),  Nfo),  N(tf,  N(t„)  and  for  any  tj,  t2,  A  ....  t„. 

The  Mean  Cumulative  Function  (MCF)  of  a  point  process  is  defined  to  be  the 
expected  value  at  N(t).  This  function  is  the  pointwise  average  of  all  population  curves 
passing  through  each  t  [27].  The  MCF  is  often  denoted  by  A (t).  Methods  for  estimation 
of  the  MCF  are  discussed  later  in  this  section. 

MCF  =  Aft)  =  EfNft)) 

Equation  4  Recurrence  Data,  Mean  Cumulative  Function 
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When  the  MCF  is  differentiable  the  derivative  is  defined  as  the  Rate  of 


Occurrence  of  Failures  (ROCOF).  The  ROCOF  is  the  instantaneous  rate  of  change  in  the 
expected  number  of  failures.  Methods  for  estimation  of  the  ROCOF  are  discussed  later  in 
this  section. 

ROCOF  =  p(t)  =  ^  Aft) 

Equation  5  Recurrence  Data,  Rate  of  Occurrence  of  Failure 

The  MCF  and  the  ROCOF  are  important  elements  to  define  the  reliability  of  a 
repairable  system.  They  define  the  expected  number  of  failures  at  time  t  and  the 
probability  of  failure  in  an  interval. 

Reliability  Prediction  (Column  1  of  Table  1) 

Reliability  Prediction  refers  to  the  use  of  probabilistic  models,  typically 
parametric,  for  the  prediction  of  the  reliability  performance  of  nonrepairable  items  and 
repairable  systems. 

Reliability  Prediction  in  the  Context  of  Life  Data  (Column  1  Row  1  of  Table  1) 

The  theoretical  models  used  to  describe  unit  lifetimes  are  Lifetime  Distribution 
Models.  The  population  is  generally  all  unit  lifetimes  for  all  of  the  units  manufactured 
based  on  a  particular  design,  material,  and  manufacturing  process  [30].  A  random  sample 
of  size  n  from  this  population  is  the  collection  of  failure  times  observed  for  a  randomly 
selected  group  of  n  units. 

A  lifetime  distribution  model  can  be  any  probability  density  function  (or  pdf)  jit) 
defined  over  the  range  of  time  from  t  =  0  to  t  =  infinity.  The  corresponding  cumulative 
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distribution  function  (or  cdf)  F(t)  gives  the  probability  that  a  randomly  selected  unit  will 
fail  by  time  t. 

The  pdf f(t)  has  only  non-negative  values  and  eventually  either  becomes  0  as  t 
increases,  or  decreases  towards  0.  The  cdf  F(t)  is  monotonically  increasing  and  goes  from 
0  to  1  as  t  approaches  infinity.  In  other  words,  the  total  area  under  the  curve  is  always  1 . 
This  is  means  that  a  single  randomly  chosen  unit  will  fail  in  infinity.  The  entire 
population  will  fail  in  infinity. 

The  most  commonly  used  distributions  used  to  model  life  data  are  the 
exponential,  Weibull,  and  gamma.  [19]  I  will  not  discuss  the  Weibull  or  gamma  functions 
in  this  paper  but  characteristics  of  the  exponential  distribution  have  a  direct  relationship 
to  the  later  discussion  of  Mean  Time  Between  Failure  (MTBF). 

Two  theorems,  4  and  5,  from  the  Rigdon,  Basu  text  [19]  state  the  two  unique 
characteristics  of  the  exponential  distribution.  The  exponential  distribution  has  the 
memoryless  property  and  it  is  the  only  continuous  distribution  with  the  memoryless 
property.  The  exponential  distribution  has  a  constant  hazard  function  and  is  the  only 
distribution  with  a  constant  hazard  function.  The  memoryless  property  means  that  the 
probability  of  failure  is  not  dependent  on  age.  The  probability  of  an  old  unit  surviving  in 
the  next  interval  is  equal  to  the  probability  that  a  new  unit  will  survive  in  the  same 
interval.  As  shown  in  the  text  [19]  the  result  of  a  constant  hazard  function  is  that  the 
mean  and  the  hazard  are  reciprocal  of  each  other. 
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Reliability  Prediction  in  the  Context  of  Recurrence  Data  (Column  1  Row  2  of 
Table  1) 

The  difference  between  the  statistical  processes  applicable  to  Life  Data  and 
Recurrent  Data  derives  from  the  observation  of  a  single  failure  per  system,  usually  the 
end  of  life,  for  nonrepairable  systems  and  multiple  numbers  of  failures  per  system  for 
repairable  systems.  Due  to  the  multiple  failures  in  a  repairable  system  the  iid  assumption 
for  the  times  between  failures  is  usually  not  valid.  [19] 

In  some  limited  cases  the  recurrence  data  may  be  iid  so  a  Homogenous  Poisson 
Process  (HPP)  could  model  the  ROCOF  function.  While  the  bathtub  hazard  function  (life 
data  Weibull  distribution)  may  look  identical  to  a  recurrence  data  HPP  the  interpretations 
are  different.  The  bathtub  hazard  function  is  an  expression  of  the  conditional  probability 
of  the  only  failure  of  the  system.  The  bathtub  ROCOF  shows  that  a  system  will  have 
many  failures  early  in  its  life,  followed  by  a  period  of  constant  ROCOF  and  then  the 
ROCOF  will  increase  as  the  system  ages  and  failures  are  more  frequent.  [19] 

In  a  paper  presented  at  the  IEEE  2005  Reliability  and  Maintainability 
Symposium,  Mettas  and  Zhao  of  the  Reliasoft  Corporation  said  two  models  commonly 
used  for  analysis  of  repairable  systems  data  are  the  perfect  renewal  process  (PRP)  and  the 
nonhomogenous  Poisson  process  (NHPP)  [31].  The  PRP  corresponds  to  an  assumption  of 
perfect  repairs  where  the  system  is  as-good-as-new  after  repair.  The  NHPP  corresponds 
to  minimal  repair  where  the  system  is  as-good-as-old  after  repair.  The  NHPP  assumption 
is  that  the  system  after  repair  is  in  no  better  condition  than  immediately  before  the  failure. 
Most  repairs  do  not  fit  either  of  the  extremes  of  the  PRP  or  NHPP  but  are  some 
complicated  intermediate.  A  general  renewal  process  (GRP)  model  attempts  to  analyze 
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complex  repairable  systems  with  varying  degrees  of  repair.  The  Mettas  and  Zhao  paper 
provides  an  overview  of  existing  repairable  system  models  and  proposes  a  formulation 
for  estimating  parameters  of  the  GRP  and  for  development  of  confidence  bounds. 

There  are  many  other  models  and  variations  of  models  presented  in  academic 
literature.  There  are  eight  different  models  discussed  in  the  Mettas  and  Zhao  paper.  These 
models  attempt  to  accommodate  such  variations  as  preventive  maintenance  effects, 
incorporate  the  results  of  simulations  such  as  Monte  Carlo,  compensate  for  small  sample 
sizes  or  few  failures,  ....  However,  as  Guo,  Ascher  and  Love  [32]  point  out,  too  much 
attention  is  paid  to  the  invention  of  new  models  with  little  thought  as  to  their 
applicability.  Too  little  attention  is  paid  to  necessary  data  collection  and  consideration  of 
the  usefulness  of  the  models  for  solving  real  reliability  problems.  These  models  are 
difficult  to  apply  to  engineering  problems  either  because  of  the  strong  assumptions  or  the 
model  complexity. 

Reliability  Measurement  (Column  2  of  Table  1) 

Reliability  Measurement  refers  to  the  analysis  of  field  data  to  monitor  and  assess 
the  reliability  performance  of  systems  in  operational  use. 

The  failure  rate  of  a  nonrepairable  component  applies  only  to  the  first  failure 
times  of  the  population  of  parts.  [30]  The  population  of  nonrepairable  parts  will  decrease 
over  the  lifetime  as  individual  parts  fail  and  are  replaced  until  all  have  failed.  A 
nonrepairable  population  is  one  for  which  individual  items  that  fail  are  removed 
pennanently  from  the  population.  While  the  system  may  be  repaired  by  replacing  failed 
units  from  either  a  similar  or  a  different  population,  the  members  of  the  original 
population  dwindle  over  time  until  all  have  eventually  failed. 
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A  repairable  system  can  be  returned  to  operational  condition  by  adjustment  or 
replacement  of  parts.  The  rate  at  which  failures  occur  during  system  usage  (and  are  then 
repaired)  is  defined  as  a  Rate  Of  Occurrence  Of  Failure  (ROCOF)  or  "repair  rate".  It  is 
incorrect  to  talk  about  failure  rates  or  hazard  rates  for  repairable  systems.  These  tenns 
apply  only  to  the  first  failure  times  (life  data)  for  a  population  of  nonrepairable 
components.  [30] 

Reliability  Measurement  in  the  Context  of  Life  Data  (Column  2  Row  1  of  Table 

1} 

The  purpose  of  measuring  the  reliability  of  nonrepairable  items  is  to  assess  the 
quality  of  the  product  and  for  comparison  to  the  expected  life.  Life  data  is  often  used  to 
support  predictive  analysis  by  fitting  measured  data  to  statistical  models. 

There  are  life  data  models  that  are  applicable  to  very  precise  Fault  Report  And 
Corrective  Action  Systems  (FRACAS)  data  and  models  that  are  intended  to  compensate 
for  less  precise  data.  Dr.  Abernathy’s  Fifth  Edition  of  the  New  Weibull  Handbook 
presents  a  comprehensive  treatment  of  the  two  most  widely  used  life  data  analysis 
models,  the  Weibull  and  the  Crow- AMS  AA,  as  well  as  a  good  overview  of  most  models 
currently  used  for  life  data  analysis. 

Life  data  analysis  is  often  oversimplified  by  applying  the  most  basic  distributions 
without  verifying,  or  even  stating,  the  underlying  assumptions.  Reliability  analysis  of  life 
data  requires  careful  planning  and  execution.  Mistakes  can  be  costly  in  terms  of  time  and 
money,  wrong  decisions  made  can  be  detrimental  to  system  operation.  [20].  Abernathy 
warns  that  it  is  tempting  to  plot  a  single  Weibull  for  systems  from  poorly  defined 
populations  and  multiple  failure  modes.  This  misapplication  will  show  a  [3  close  to  one, 
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roughly  equivalent  to  using  mean-time-to-failure  (MTTF)  and  exponential  reliability,  and 
masking  infant  mortality  and  wear  out  modes.  The  results  are  not  meaningful  for 
individual  failure  modes.  This  method  was  common  and  is  still  used  by  those  unaware  of 
the  advantages  of  newer  methods  for  system  models.  [28] 

Meeker  suggest  a  general  strategy  for  analysis  of  life  data  [20]: 


1 .  Begin  with  graphical  analysis  without  making  any  distributional  or  model 
assumptions. 

2.  Fit  one  or  more  parametric  models  depending  on  the  purpose  of  the  study 
and  the  amount/source  of  data. 

3.  Asses  the  adequacy  of  the  model. 

4.  If  there  are  no  “obvious  departures  from  the  assumed  model,  one  will 
generally  proceed,  with  caution,”  to  predict  future  outcomes  with 
statistical  intervals  showing  uncertainty  and  variability. 

5.  Display  results  graphically  including  estimates,  predictions,  and 
uncertainty  bounds. 

6.  Assess  the  adequacy  of  the  model  assumptions  and  provide  the 
conclusions  with  the  reliability  results. 

Reliability  Measurement  in  the  Context  of  Recurrence  Data  (Column  2  Row  2  of 

Table  1) 

Recurrence  events  are  analyzed  over  a  period  for  a  single  repairable  system  or  for 
multiple  similar  systems.  Early  repairable  system  data  analysis  techniques,  1952  to  1991, 
focused  on  times  to  first  occurrence,  times  to  second  occurrence  ...,  or  times  between 
occurrences.  [24]  [33]  .  Later  methods  use  parametric  counting  process  models  and 
analysis  for  the  number  of  occurrences.  [19]  [20]  Estimation  of  the  model  parameters 
requires  iterative  procedures  or  special  software.  There  is  a  significant  body  of  literature 
on  the  subject  but  parametric  methods  are  computationally  intensive  and  not  intuitive. 
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Necessary  assumptions  are  rarely  stated,  investigated  or  justified  [9].  Nelson  provides  a 
less  complex  method  for  nonparametric  analysis  of  recurrent  data  that  is  also  applicable 
to  cost  and  other  “observed  values”  of  events.  The  process  is  nonparametric  in  that  it  does 
not  specify  a  point  process  model  for  the  recurrence  rate.  The  Nelson  process  is  based  on 
the  Mean  Cumulative  Function  (MCF).  [27]  Trindade  and  Nathan  present  a  tutorial  of  a 
practical  application  methodology  of  the  MCF  analysis  based  on  their  work  with  Sun 
Microsystems  [34]. 

To  apply  nonparametric  analysis  of  recurrent  event  data  analysis  each  unit  of  the 
population  is  described  by  a  cumulative  history  function  for  the  number  of  event 
recurrences  over  time.  Figure  1  depicts  a  single  unit's  cumulative  history  function: 


Figure  1  Cumulative  Number  of  Failures  of  a  Single  System  [35] 
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The  nonparametric  model  for  the  population  is  the  cumulative  history  functions  of 
all  units  of  the  population.  At  a  time  t,  the  units  have  a  distribution  of  their  cumulative 
number  of  events.  This  distribution  differs  at  different  times  t  and  has  a  mean  M(t)  called 
the  MCF  .  The  MCF  is  the  pointwise  average  of  all  cumulative  history  functions  as 
shown  in  Figure  2. 


9  i 


□  t  50  100  ISO  200  250 

Aye 


Figure  2  MCF  and  Population  Distribution  at  Time  t.  [35] 

When  the  data  is  uncensored  (all  units  in  the  population  are  still  operating  at  the 
point  in  time)  the  MCF  values  at  different  recurrence  times  are  estimated  by  calculating 
the  average  of  the  cumulative  number  of  recurrences  of  events  for  each  unit  in  the 
population  at  that  point  in  time.  Suppose  that  the  cumulative  value  for  a  sample  unit  i  by 


24 


time  t  is  Yi  (t),  i  =  1,2,..  ,,N.  Then  the  estimate  of  the  MCF  at  time  t  is  simply  the 
average  of  the  cumulative  values  at  age  t.  [35] 

M(t)  =  [Yi(t)  +  Y2(t)  +■  ■  ■  +YN(t)]/N 

Equation  6  Estimate  of  MCF  or  M(t) 

When  the  data  is  censored  (some  units  in  the  population  stopped  operating  prior 
to  ti)  the  censoring  times  must  be  considered  as  explained  by  Nelson.  [27] 

Rate  of  of  OCcurance  Of  Failure  (ROCOF)  can  be  estimated  from  the  estimate  of 
the  MCF  by  calculating  the  slope  of  the  MCF  at  t. 

The  Trindade  and  Nathan  process  is  explained  and  adapted  to  the  purpose  of  this 
paper  in  Section  3.  It  will  be  applied  to  a  set  of  real  USAF  historical  maintenance  data  to 
demonstrate  the  utility  and  applicability. 

Applicability  of  MTBF 

Mean  Time  Between  Failure  (MTBF)  is  only  applicable  to  repairable 
systems/components.  It  is  often  incorrectly  used  interchangeably  with  Mean  Time  To 
Failure  (MTTF)  of  nonrepairable  items. 

MTBF  is  a  reliability  “buzz  word”.  Numbers  are  used  without  an  understanding 
of  what  they  truly  represent.  Basic  and  necessary  assumptions  are  not  stated.  While 
MTBF  may  be  an  indication  of  reliability,  it  does  not  necessarily  represent  the  expected 
service  life  of  a  nonrepairable  product  or  the  expected  failure  free  period  of  a  repairable 
system.  Ultimately  an  MTBF  value  is  meaningless  if ‘failure’  is  undefined  and 
assumptions  made  in  the  calculation  are  not  stated  or  are  unrealistic. 


25 


The  only  completely  accurate  way  to  calculate  MTBF  for  a  nonrepairable  product 
(actually  MTTF  for  nonrepairable  product)  is  to  wait  until  every  unit  in  the  population 
has  failed,  or  for  a  repairable  system  wait  until  the  system  is  retired,  and  then  do  the 
calculations.  This  is  obviously  impractical  so  MTBF  is  generally  estimated.  Assumptions 
are  required  to  estimate  MTBF.  This  can  lead  to  numbers  that  don’t  have  a  value  in 
themselves  but  have  some  value  in  a  relative  sense.  That  is,  the  reliability  of  two  products 
or  systems  can  be  compared  IF  calculated  in  EXACTLY  the  same  way  and  IF  ALL  the 
same  assumptions  are  made  and  validated. 

A  common  misconception  about  MTBF  is  that  it  is  the  expected  period  between 
system  failures.  It  is  not  uncommon  to  see  MTBF  numbers  on  the  order  of  a  million 
hours.  It  is  unrealistic  to  believe  that  a  system  could  operate  continuously  for  more  than 
100  years  without  failure. 

MTBF  does  not  mean  the  expected  failure  free  period,  the  useful  life,  or  the 
average  life.  So  what  does  it  mean?  As  with  many  questions,  the  answer  depends  on  the 
context. 

Definition:  The  expectation  of  the  operating  time  between  failures.  [36]  The 
general  expression  for  MTBF  is  given  by: 

E{t}  =  MTBF  =  /0°°  x  f(x)dx o°°=  J0°°  R(x)dx, 
where  R(t)  denotes  the  reliability  (perfonnance). 

Equation  7  Mean  Time  Between  Failure  (MTBF) 

When  the  system  Rate  of  Occurrence  Of  Failure  (ROCOF)  is  constant  with  iid 
failure  times  the  operating  times  between  failure  recurrences  can  be  represented  by  the 
Homogenous  Poisson  Process  (HPP)  model  [20],  Remember  that  the  HPP  of  a  repairable 
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system  or  systems  is  often  confused  with  the  exponentially  distributed  failures  of 
nonrepairable  systems.  If  that  mistake  were  made  to  calculate  the  MTBF  of  an  HPP 
failure  recurrence  distribution  for  a  repairable  system  the  units  would  be  the  correct 
giving  the  appearance  of  a  correct  result.  But  the  underlying  data  would  not  be  correctly 
applied,  the  number  would  not  be  accurate,  and  the  conclusion  would  be  wrong.  This  will 
be  demonstrated  in  Chapter  III. 

The  HPP  assumption  (actually  the  exponential  distribution)  is  widely  used, 
although  inappropriately,  in  the  development  of  preventive  maintenance  strategies  for 
repairable  systems.  In  many  cases,  the  MTBF  is  used  to  determine  a  preventive 
maintenance  interval  for  a  component.  However,  the  use  of  the  MTBF  metric  implies  that 
the  data  were  analyzed  with  an  HPP  since  the  mean  will  only  fully  describe  the 
recurrence  rate  when  the  HPP  is  used  for  analysis.  The  use  of  the  HPP,  in  turn,  implies 
that  the  component  has  a  constant  ROCOF.  This  now  begs  the  question  of  why  anyone 
would  preventively  replace  a  component  that  has  a  constant  ROCOF  and  does  not 
experience  wear-out  over  time!  With  a  constant  ROCOF  assumption,  preventive 
maintenance  actions  do  not  improve  the  reliability  of  the  component,  but  rather  waste 
time  and  parts 
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Once  a  MTBF  is  calculated  based  on  the  HPP  assumption,  what  is  the  probability 
that  any  one  particular  repairable  system  will  be  operational  at  time  equal  to  the  MTBF? 
We  have  the  following  equation: 

R(t)  =  et,MTBF 

But  when  t  =  MTBF 

R(t)  =  e1  =  0.3677 

This  tells  us  that  the  probability  that  any  one  particular  system  will  operate  without 
failure  to  its  calculated  MTBF  is  only  36.8%. 

Inadequacy  of  MTBF  as  a  Measure  of  Reliability  for  Repairable  Systems 

MTBF  is  the  most  often  cited  measure  of  reliability  of  repairable  systems.  MTBF 
is  literally  the  total  operating  time  divided  by  all  failures.  The  common  concept  of  MTBF 
assumes  a  one  pass  lifecycle,  or  a  perfect  repair  process,  where  all  failures  come  from  a 
single  population  distribution.  This  assumption  is  predicated  on  an  assumption  of  a 
Homogenous  Poisson  Process  (HPP).  For  the  assumption  to  be  valid  each  failure  is 
statistically  independent  and  identically  distributed  (iid).  Under  this  assumption  the  mean 
completely  characterizes  the  distribution  and  the  ROCOF  is  constant.  The  validity  of  the 
assumption  is  rarely  checked  or  stated.  [34], 

Trindade  and  Nathan  say  the  popularity  of  the  MTBF  metric  is  due  to  its 
simplicity  and  its  ability  to  cater  to  the  one  number  syndrome  [9].  The  exponential  failure 
distribution  assumption  makes  analysis  very  simple  but  it  does  not  apply  to  most  real 
systems.  If  this  model  were  applicable  to  automobiles  the  reliability  would  not  be 
dependent  on  mileage.  If  a  product  wears  out  or  becomes  less  reliable  it  obviously  does 
not  have  a  constant  failure  rate.  Using  the  MTBF  when  it  is  not  appropriate  can  lead  to 
missed  failure  trends  and  wrong  conclusions  about  the  reliability  of  the  systems.  [9] 
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USAF  Calculation  of  MTBF  as  a  Measure  of  Reliability 

The  U.S.  Air  Force  defines  MTBF  in  Technical  Order  00-2-2,  Maintenance 
Documentation.,  Mean  Time  Between  Failure  (Inherent).  Inherent  refers  to  a  Type  1 
failure  or  actual  failure  of  the  item. 

MTBF-1  (INHERENT)  =  FLYING  HOURS  *  QPA  *  UF  /  INHERENT  FAILURES 
NOTE:  The  Usage  Factor  (UF)  is  the  ratio  of  end  items  with  the  WUC  configuration  to  the  end  items 
accruing  Flying  Hours.  The  Quantity  Per  Assembly  (QPA)  is  the  number  of  the  WUC  items  installed  per 

end  item. 

Equation8  USAF  Definition  of  MTBF 

MTBF,  as  defined  by  the  AF,  is  a  discrete  number  calculated  from  fleet  total  of 
flying  hours  and  failures.  The  discrete  number  does  not  give  any  insight  into  the 
characteristics  of  the  distribution  beyond  the  arithmetic  mean. 

Using  fleet  total  flying  hours  for  the  calculation  causes  significant  loss  of 
accuracy  (increased  confidence  interval)  as  the  sample  size  decreases.  Small  sample  sizes 
are  common  to  small  fleets,  few  failures,  or  short  sample  periods. 

The  practical  usage  of  USAF  MTBF  is  for  windowed  periods  of  the  total 
lifecycle.  Often  metrics  are  reported  in  quarterly  or  annual  intervals. 

This  application  is  a  lagging  indicator.  No  information  is  available  until  the  end  of 
the  period,  significant  latency  in  the  data  availability.  If  an  attempt  is  made  to  shorten  the 
period  to  reduce  the  latency  the  number  of  events  in  the  period  decreases.  At  some  point 
the  MTBF  is  undefined  (zero  events  in  the  period). 

The  magnitude  of  the  MTBF  is  dependent  on  the  choice  of  period  interval  and 
location.  The  calculation  for  the  selected  interval  is  inaccurate  due  to  left  and  right  data 
censoring.  Left  and  right  data  censoring  is  where  units  of  the  population  operated  for  a 
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significant  time  without  failure  before  (left  censoring)  and  after  (right  censoring)  the  data 
interval.  It  is  obvious  that  as  the  data  analysis  period  is  reduced  relative  to  the  expected 
failure  free  period  of  the  systems  the  censoring  error  becomes  large. 

The  USAF  uses  maintenance  data  to  document  the  system  failures.  There  is  no 
method  within  that  data  system  to  define  specific  failure  modes,  correlate  repair  actions 
across  levels  of  maintenance,  or  track  the  numbers  of  systems  in  use  at  any  given  time. 
Operating  time  is  not  accurately  tracked  below  the  end  item  level.  This  creates  a 
conglomeration  of  failures  (hardware,  software,  test  deficiency,  ...)  tracked  in  a  single 
distribution.  It  provides  no  ability  to  drill  done  below  the  end  item  level  for  root  cause 
analysis. 

Clearly  there  is  a  need  for  better  USAF  reliability  metrics  that  account  for  trends 
and  allow  for  drill  down  to  root  cause. 

Summary 

According  to  John  Usher  [23]  even  though  most  complex  systems  are  repaired, 
not  replaced,  when  they  fail  many  reliability  practitioners  use  statistical  methods  and 
models  that  are  only  appropriate  for  nonrepairable  systems.  Ascher  and  Feingold  [24] 
discuss  serious  issues  with  treating  repairable  systems  reliability  data  as  if  it  were  from  a 
nonrepairable  system. 

The  USAF  not  only  uses  the  wrong  metric  to  assess  the  reliability  of  fielded 
repairable  systems,  the  wrong  metric  is  calculated  incorrectly.  There  is  an  applicable 
process  available,  which  can  be  used  with  existing  USAF  data,  to  provide  a  more 
effective  measure  of  repairable  system  reliability. 
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III.  Methodology 


Chapter  Overview 

To  demonstrate  that  the  calculation  of  MTBF  is  not  an  effective  measure  of  field 
repairable  USAF  systems  an  illustrative  set  of  maintenance  data  will  be  used  to  represent 
the  failure  history  of  similar  systems.  This  set  of  data  will  be  used  to  show  that  the  MTBF 
calculation  is  completely  dependent  on  the  time  interval  and  location  of  the  data  sample 
and  is  not  necessarily  a  characteristic  of  the  system’s  reliability. 

The  graphical  and  nonparametric  analysis  process  presented  by  Dr.  Nelson  [35], 
further  refined  and  used  by  Sun  Microsystems  Inc.  for  reliability  analysis  of  their 
products  [9],  will  be  applied  to  historical  USAF  maintenance  data.  The  output  of  that 
analysis  process  will  be  compared  to  the  MTBF  calculations  (Equation  8)  from  the  same 
data,  in  the  same  period,  to  show  the  relative  merits.  The  example  is  similar  to  one  used 
in  Dr.  Nelson’s  book,  Recurrent  Events  Data  Analysis  for  Product  Repairs,  Disease 
Recurrences,  and  Other  Applications  [27]. 
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The  Adapted  Recurrent  Events  Data  Analysis  Process 


The  process  below  is  a  generalized  adaptation  of  the  more  statistically  rigorous 
process  presented  by  Dr.  Nelson  in  his  text  [27].  This  process  more  closely  follows  the 
algorithmic  analysis  process  presented  by  Trindade  and  Nathan  of  Sun  Microsystems  [9]. 


1 .  Obtain  a  data  set  of  ordered  event  recurrence  intervals  from  each  system.  The 
events  are  a  function  of  usage  such  as  operating  time,  cycles,  calendar  days, 

2.  Plot  the  recurrence  events  on  a  cumulative  timeline.  This  graphical  analysis 
technique  will  show  obvious  trends  or  outliers  in  the  data. 

3.  Plot  the  cumulative  event  functions  for  each  system  of  the  population  or  of  a 
statistically  valid  random  sample  of  systems. 

4.  Plot  the  Mean  Cumulative  Function  (MCF)  and  confidence  bounds  for  the 
data  set. 

5.  Plot  the  ROCOF  of  the  systems.  The  recurrence  rate  is  the  derivative  of  the 
MCF  at  a  point  in  time.  Because  there  is  no  closed  fonn  solution  to  the 
derivative  of  the  MCF  the  Recurrence  Rate  will  be  approximated  by 
calculating  the  slope  of  the  MCF  at  the  point  in  time. 

The  analysis  process  can  be  completed  for  relatively  small  data  sets  using  a 
spreadsheet  program  such  as  MS  Excel.  MS  Excel  was  used  to  for  the  notional  example 
in  this  chapter.  For  larger  data  sets  a  more  specialized  program  is  useful.  To  process  the 
very  large  USAF  maintenance  data  sets  MS  Excel  was  used  to  combine,  format,  and 
correlate  the  necessary  data.  JMP  was  used  to  calculate  the  MCF  and  confidence  interval. 
The  JMP  data  was  exported  back  into  excel  for  plotting.  JMP  does  provide  output  plots 
but  they  are  image  files  with  few  formatting  options  for  reporting. 
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Recurrent  Events  Data  Analysis  Methodology  Example 

Dr.  Nelson  uses  a  five  population  staircase  history  function  to  introduce  and 
explain  the  concepts  of  a  nonparametric  population  model  (cumulative  history  functions 
of  all  units)  and  Mean  Cumulative  Function  (MCF)  [27].  I  will  use  the  same  technique 
and  similar  data  for  five  identical  systems,  each  operating  for  100  hours,  to  illustrate  the 
adapted  recurrent  events  analysis  process.  The  USAF  calculation  of  Mean  Time  Between 
Failure  (MTBF)  is  calculated  (Equation  8),  for  two  intervals,  20  and  50  hours,  and 
overlaid  on  the  cumulative  failure  functions 


1.  Obtain  a  set  of  ordered  event  recurrence  intervals  from  each  system  (or  from 
each  similar  system). 

.  The  cumulative  failure  events  of  a  notional  sample  of  five  similar  systems,  each 
operating  for  100  hours,  are  shown  in  Table  2. 

Table  2  Cumulative  Failures  as  a  Function  of  Time 
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2.  Plot  the  recurrence  events  on  a  cumulative  timeline. 


The  data  from  Table  2  was  used  in  JMP  to  create  the  event  plot  in  Figure  3.  This 
plot  may  provide  a  quick  graphical  illustration  of  the  systems’  reliability.  The  small  data 
set  does  not  readily  support  an  iid  assumption.  While  the  HPP  may  be  a  reasonable  model 
for  each  system  it  appears  that  the  systems  are  not  identical  (different  MTBF)  so  a  HPP 
may  not  represent  the  stochastic  point  process  for  the  failures  of  these  multiple  systems. 
The  assumption  of  a  constant  ROCOF  and  the  use  of  a  windowed  calculation  of  MTBF 
(period  less  than  total  failure/usage)  would  not  be  appropriate  for  this  data  set.  But  fitting 
to  a  parametric  model  is  not  necessary  for  cause  analysis. 

The  plot  of  failure  recurrences  as  a  function  of  cumulative  time  provides  a  quick 
top  level  indicator  of  issues  that  impact  system  availability  that  may  not  be  hardware 
reliability  issues.  From  Figure  3  it  can  be  seen  that  System  E  operated  without  failure  to 
100  hours.  System  B  had  two  sets  of  clustered  failures.  As  the  system  failures  events  are 
only  reported  every  10  hours  the  second  failures  of  the  System  B  clusters  may  have  been 
a  very  short  time  after  repair. 
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Figure  3  JMP  Event  Plot 

3.  Plot  the  cumulative  event  functions  for  each  system  of  the  population  or  of  a 
statistically  valid  random  sample  of  systems. 

A  cumulative  plot  is  a  simple  graph  that  can  be  constructed  from  a  set  of  events- 
of-interest  for  a  repairable  system.  This  plot  can  be  constructed  for  all  failures,  outages, 
system  failures  due  to  specific  failure  modes  etc.  A  cumulative  plot  can  be  constructed 
for  just  one  system,  for  a  statistical  sample,  or  for  all  systems  in  a  population.  The 
cumulative  plot  in  Figure  4  is  a  plot  of  the  number  of  failures  on  each  system  versus  the 
operating  hours  of  the  system.  The  cumulative  plot  reveals  the  sequence  of  events  with 
operating  time.  For  example,  System  A  had  one  failure  at  10  hours  and  was  failure  free  to 
90  hours.  System  E  operated  for  100  hours  before  failure. 


35 


4 


3 


Cummulative^ 

Failures 


1  - 


0 

0 


i  r 


i _ i 


20  40  60  80 

Operating  Hours 


100 


—  •  •  System  A 
. System  B 

—  —  System  C 

-  •  System  D 

-  System  E 


Figure  4  Plot  of  Cumulative  Failures  of  Five  Similar  Systems 
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4.  Plot  the  Mean  Cumulative  Function  (MCF)  and  confidence  bounds  for  the 

data  set. 

The  MCF  is  a  useful  construct  to  plot  the  average  behavior  of  large  populations  of 
repairable  systems/items.  The  MCF  is  constructed  incrementally  at  each  recurrent  event 
by  calculating  the  mean  quantity  of  recurrent  events  of  the  population  of  systems  at  risk 
at  that  point  in  time.  The  number  of  systems  at  risk  is  the  number  of  systems  that  are 
operating  and  providing  information.  [9] 

Information  can  be  obscured  by  data  censoring  and  truncation.  One  could  also 
have  interval  or  window  censoring  that  is  dealt  with  extensively  in  [9].  The  MCF 
accounts  for  systemic  gaps  in  data  by  appropriately  nonnalizing  by  the  number  of 
systems  at  risk. 

The  example  in  Figure  5  shows  the  MCF  for  the  five  similar  systems.  The 
calculation  at  each  point  of  interest  would  be  in  the  manner  of  these  samples: 

MCF(IO)  =  4(0)41(1  J  =  0.20 

MCF(50)  =  1(°)+3(51)+1(2i  =  1 .00 

MCF(IOO)  =  QW+fi1)+2^2)+fi3)+1(4)  =  2.40 

For  this  trivial  example  the  95%  confidence  interval  was  calculated  using  the 
built-in  confidence  function  in  MS  Excel.  The  more  sophisticated  recurrence  analysis 
algorithm  in  JMP  fits  a  distribution  to  the  recurrences  of  the  population  of  systems  at  the 
point  in  time  and  detennines  the  confidence  interval  from  that  distribution.  Dr.  Nelson 
provides  procedures  for  calculating  point- wise  confidence  bounds  [27]. 

A  quick  look  at  this  MCF  plot  with  confidence  intervals  shows  the  systems  that 
are  out  of  the  ‘normal’  range.  While  these  are  not  necessarily  outliers  in  the  strict 
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statistical  sense  it  is  a  strong  visual  indication  of  problem  systems/items.  [9]  From  Figure 
5  it  can  be  seen  that  System  A  fails  early,  outside  of  the  confidence  interval,  then 
recovers.  System  B  is  always  at  near  the  top  of  the  confidence  interval  and  is  out  of  the 
interval  40  out  of  100  hours  and  again  at  the  end  of  the  observation  period.  System  B 
would  be  an  excellent  candidate  for  specific  root  cause  analysis.  Notice  that  System  E  is 
out  below  the  confidence  bounds  for  the  entire  period.  It  would  be  good  to  examine  that 
system  to  see  why  it  is  so  reliable. 


—  •  -System  A 
. System  B 

—  —  System  C 

-  ■  System  D 

-  System  E 

^bMCF 

Conf  Interval 
Conf  Interval 


Figure  5  Plot  of  the  MCF  for  Five  Similar  Systems  with  95%  Confidence  Interval 


38 


5.  Plot  the  ROCOF  of  the  systems. 

The  ROCOF  is  approximated  by  calculating  the  slope  of  the  MCF  at  the  point  in 
time.  It  is  expressed  in  events  per  unit  of  time  per  population  unit.  The  ROCOF  plot  in 
Figure  6  is  simply  the  slope  of  the  MCF  points  as  calculated  by  MS  Excel  and  is  a  poor 
approximation  due  to  the  small  sample  size.  It  demonstrates  the  concept  to  be  applied  in 
section  IV. 
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Figure  6  ROCOF  for  Five  Similar  Systems 

ROCOF  Compared  to  USAF  Calculation  of  MTBF  as  a  Measure  of  Reliability 

The  USAF  calculation  of  Mean  Time  Between  Failure  (MTBF)  using  Equation  8,  is  calculated  for 
two  intervals,  20  hours  and  50  hours.  These  values  are  overlaid  on  the  cumulative  failure  functions  in 

Figure  7  to  demonstrate  the  process  that  will  be  used  with  actual  USAF  data  in 
section  IV. 
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For  comparison  of  the  USAF  calculation  of  MTBF  to  the  ROCOF  the  inverse  of 
MTBF  will  be  used.  As  discussed  in  section  II  the  USAF  incorrectly  uses  MTBF  as  the 
measure  of  reliability  (often  cited  as  a  ‘failure  rate’)  in  a  period.  The  incorrect  usage  is 
related  to  the  assumption  of  the  special  case  of  the  HPP  where  the  failure  rate  is  constant 
and  the  inverse  of  the  MTBF.  This  leads  to  the  use  of  failures/time  as  the  failure  rate  but 
even  in  the  special  case  of  the  HPP  the  strict  definition  of  failure  rate  is  the  inverse,  in 
units  of  time/failures,  the  same  units  as  the  ROCOF. 

If  the  MTBF  is  used  as  the  measure  of  reliability  for  this  population  of  systems  it 
can  be  seen  that  there  is  no  information  available  until  the  end  of  the  period.  This  makes  a 
significant  lag  in  the  data  availability.  If  an  attempt  is  made  to  shorten  the  period  for 
additional  resolution  the  events  in  the  period  decrease.  At  some  point  the  MTBF  is 
undefined  (zero  events  in  the  period).  It  can  also  be  seen  that  the  magnitude  of  the  MTBF 
is  dependent  on  the  choice  of  period  interval  and  location.  The  calculation  for  the 
selected  interval  is  inaccurate  due  to  left  and  right  data  censoring.  Left  and  right  data 
censoring  is  where  units  of  the  population  operated  without  failure  for  a  significant  time 
before  (left  censoring)  and  after  (right  censoring)  the  data  interval. 

Incorrect  analysis  of  the  reliability  of  this  population  of  systems  on  the  basis  of 
the  20  hour  MTBF  would  say  that  the  ‘failure  rate’  of  the  population  of  systems  initially 
is  0.01  failures  per  hour.  After  20  hours  of  usage  the  failure  rate  would  increase  to  0.015 
failures  per  hour.  The  lowest  failure  rate  would  be  at  60  hours  and  the  highest  at  100. 

Analysis  of  the  reliability  of  this  population  of  systems  on  the  basis  of  the  50  hour 
MTBF  would  say  that  the  ‘failure  rate’  of  the  population  of  systems  initially  is  0.02 
failures  per  hour  and  increases  slightly  to  0.028  at  100  hours. 
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The  ROCOF  by  comparison  more  closely  follows  the  time  of  the  actual  system 


failures  and  provides  better  resolution  and  accuracy. 
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Figure  7.  USAF  Comparison  of  MTBF  Derived  ‘Failure  Rate’  and  ROCOF 
Over  Two  Different  Periods  (20  and  50  Hours) 


Summary 

The  notional  data  presented  in  this  chapter  illustrates  the  concepts  that  will  be 
applied  to  the  real  USAF  maintenance  data  in  the  next  chapter.  The  adapted  recurrent 
events  data  analysis  process  will  show  that  near-real-time  information  can  be  obtained 
about  the  reliability  performance  of  a  population  of  fielded  repairable  systems.  The 
process  also  allows  analysis  of  the  reliability  performance  of  items,  or  changing  subsets, 
within  the  population  relative  to  others  in  the  population  and  to  their  own  history. 
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IV.  Analysis  and  Results 


Chapter  Overview 

The  data  required  for  recurrent  event  analysis  of  USAF  systems  are  generally 
collected  but  the  necessary  data  elements  are  not  linked  together.  Usage  hours  are 
accurately  collected  at  the  end  item  level  so  subsystem  failure  recurrence  times  are 
available.  But  usage  time  is  not  generally  tracked  on  subassemblies  or  components.  Some 
data  subassembly/component  data  can  be  correlated  by  associating  removal/install  times 
with  the  usage  of  the  end  item  but  there  is  no  standard  serialization  schema  and  no  error 
checking  so  data  accuracy  is  very  poor  using  that  method. 

The  demonstration  presented  in  this  chapter  is  analysis  of  two  years  of 
organizational  (flightline)  maintenance  data  for  three  subsystems  of  a  weapon  system  in 
four  basic  configurations.  The  four  weapon  system  basic  configurations  will  be  called  A, 
B,  C,  and  D.  The  subsystems  that  are  mostly  common  across  each  configuration  will  be 
identified  by  their  three  digit  Work  Unit  Code  (WUC)  14A,  74A,  and  74C.  There  are 
some  limitations  to  using  two  years  of  data  from  a  weapon  system  that  was  fielded  more 
than  15  years  ago  and  continues  in-service  after  the  two  year  period.  Those  limitations 
will  be  discussed  in  context. 

The  data  was  exported  from  LIMS-EV,  combined  and  fonnatted  in  MS  Excel. 
Recurrence  calculations  were  done  using  the  commercially  available  statistical  analysis 
software  JMP.  JMP  has  Reliability  and  Survival  functions  that  include  recurrence  data 
analysis.  JMP  produces  the  Event  Plot  (Figure  3and  Figure  8)  and  calculates  the  MCF 
with  Upper  and  Lower  95%  Confidence  Levels  (UCL  and  LCL). 
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The  subsystem  recurrence  data  is  available  via  LIMS  EV  but  is  not  easily 
accessible  nor  is  it  exportable  in  a  directly  useable  fonnat.  More  than  600  worksheets 
were  created  to  assemble  and  format  the  necessary  data  for  the  demonstration  (24 
months,  four  weapon  system  configurations,  three  subsystems)  in  this  chapter.  The 
volume  of  data  and  the  large  amount  of  manual  data  manipulation  makes  the  probability 
of  error  near  one.  In  spite  of  the  likelihood  of  error  the  analysis  does  provide  business 
intelligence  and  actionable  evidence  of  issues  effecting  weapon  system  availability. 

1.  Obtain  a  set  of  ordered  event  recurrence  intervals  from  each  system  (or  from 
each  similar  system). 

Reports  were  exported  from  LIMS-EV  to  obtain  the  necessary  usage  data  for  the 
weapon  system.  To  accurately  identify  trends  in  the  data  it  is  necessary  to  monitor  the 
failure  recurrence  data  on  a  daily  basis.  LIMS-EV  limits  data  exports  of  daily  data  to  a 
maximum  of  one  month.  The  data  for  failures  and  for  operating  hours  are  different 
reports  so  must  be  exported  separately.  To  get  two  years  of  data  for  four  weapon  system 
configurations  and  three  subsystems  on  each  configuration  required  576  individual 
exports  from  LIMS-EV  (24  months  *  4  configurations  *  3  subsystems  *  2  hour  and 
failure  reports)  The  LIMS-EV  interface  is  very  quirky,  often  timing  out  before  a  query 
can  be  set  up,  similar  queries  output  in  different  formats,  so  query  sizes  must  be  small. 
The  process  is  detailed  in  Attachment  A. 
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2.  Plot  the  recurrence  events  on  a  cumulative  timeline. 


The  sequential  time-to-failure  data  set  for  three  subsystems  of  each  instance  of  the 
weapon  system  over  two  years  was  imported  into  JMP.  The  built-in  JMP  Recurrent  Event 
Analysis  was  used  to  plot  the  recurrent  events  on  a  cumulative  timeline,  Figure  8,  and  to 
calculate  the  MCF  and  confidence  bounds. 

Figure  8  is  a  sample  of  the  population  of  a  USAF  weapon  system  end  items’ 
cumulative  events  timeline  plotted  in  operating  hours.  The  events  are  failures  of  a 
subsystem  of  common  design  across  the  weapon  system.  Each  timeline  represents  the 
failure  history  of  a  serialized  end  item.  The  vertical  lines  on  the  timeline  represent  the 
failures. 
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Figure  8  Event  Plots  of  Subsystem  Failures  for  a  Sample  of  the  Weapon 
System  End  Item  Population. 
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The  differences  in  the  failure  distribution  across  the  population  and  the  clustering 
of  failures  on  individual  end  items  suggest  that  the  failure  distribution  for  the  population 
is  not  iid.  Without  an  interval  estimate  of  the  HPP  intensity  parameter  [19]  the  HPP 
would  not  be  an  appropriate  assumption  for  failures  of  this  subsystem;  a  constant 
ROCOF  would  not  be  an  appropriate  assumption.  But  as  with  the  illustrative  set  of  data 
shown  in  Figure  3,  fitting  to  a  parametric  model  is  not  necessary  for  cause  analysis. 

Important  conclusions  can  be  made  from  a  quick  review  of  the  data  presented 
Figure  8.  The  subsystem  operates  without  failure  for  long  operational  periods  on  many 
end  items.  Other  end  items  have  a  relatively  large  number  of  failures  and  many  of  the 
failures  across  the  sample  are  clustered.  The  long  periods  of  operation  without  failure 
would  suggest  that  there  is  not  a  problem  with  inherent  reliability  of  the  subsystem 
hardware  design  or  implementation.  The  clustered  failure  pattern  on  some  end  items 
would  suggest  poor  fault  isolation  procedures  or  training  or  components  used  to  repair 
the  subsystem  have  a  significant  number  of  dead-on-arrivals.  To  identify  the  root  causes 
of  failures  in  this  subsystem  a  similar  analysis  of  the  recurrent  failure  event  histories  of 
the  subsystem  components  would  be  valuable. 
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3.  Plot  the  cumulative  event  functions  for  each  system  of  the  population  or  of  a 
statistically  valid  random  sample  of  systems. 

Figure  9  and  Figure  10  are  plots  of  the  MCF  of  failures  of  two  different 
subsystems  respectively  with  a  design  common  across  the  USAF  weapon  system.  The 
weapon  system  is  divided  into  four  subset  configurations  due  to  the  large  size  of  the 
weapon  system  population.  The  MCF  of  the  subset  configurations  are  plotted  with  the 
MCF  of  the  weapon  system.  These  four  subset  configurations  are  operating  at  the  same 
time  at  different  locations  and  in  different  commands  across  the  USAF. 


Figure  9  Plot  of  MCF  of  Subsystem  2  for  the  Weapon  System  and  Subset 
Configurations 

Figure  9  and  Figure  10  demonstrate  the  capability  to  examine  and  compare  the 
relative  reliability  performance  of  individual  end  items  or  subsets  of  the  population. 
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In  Figure  9  it  is  seen  that  Configuration  A  has  a  significantly  higher  numbers  of 
failures  per  hour  and  drives  the  weapon  system  MCF.  This  subsystem  is  the  same  across 
all  four  configurations  so  stands  to  reason  that  an  external  factor  is  driving  the  failures  of 
subsystem  2  in  configuration  A.  It  is  possible  that  a  nonmaterial  change  to  configuration 
A  would  make  an  improvement  in  the  weapon  system  reliability.  Figure  10  shows 
consistent  reliability  across  all  four  weapon  system  configurations  for  subsystem  1 . 


Figure  10  Plot  of  MCF  of  Subsystem  1  for  the  Weapon  System  and  Subset 
Configurations 
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4.  Plot  the  Mean  Cumulative  Function  (MCF)  and  confidence  bounds  for  the  data 
set. 

Figure  1 1  demonstrates  the  MCF  of  one  subsystem  of  a  USAF  weapon  system. 
The  confidence  interval  is  calculated  by  JMP  as  an  Upper  Control  Limit  (UCL)  and 
Lower  Control  Limit  (LCL)  by  fitting  a  distribution  to  the  cumulative  failures  of  all  end 
items  in  the  population  at  each  failure  time.  In  this  case  the  interval  is  95%.  The  MCF 
normalizes  for  population  size  so  the  confidence  interval  increases  with  the  operational 
hours  due  to  the  smaller  population  that  accrues  that  many  hours.  The  same  characteristic 
results  in  more  stepwise  character  of  the  MCF  as  the  population  size  decreases  and 
individual  failures  have  a  relatively  larger  impact  on  the  mean. 

This  plot  allows  a  prediction  to  be  made  about  failures  as  a  function  of  operating 
hours.  From  the  data  shown  in  Figure  1 1  the  first  failure  of  subsystem  2  of  an  individual 
end  item  occurred  between  370  and  470  hours  and  the  second  occurred  between  940  and 
1050  hours  95%  of  the  time. 


Figure  11  MCF  for  Subsystem  2  with  95%  Confidence  Limits 
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5.  Plot  the  ROCOF  of  the  systems. 

There  is  no  closed  form  solution  to  the  derivative  of  the  MCF  so  the  ROCOF  is 
approximated  by  calculating  the  slope  of  the  MCF  at  the  point  in  time.  The  ROCOF  data 
presented  in  these  charts  was  calculated  by  using  the  MS  Excel  slope  function  across 
seven  data  points,  three  before  and  three  after  each  calculation  point.  This  results  in  more 
smoothing  at  the  right  end  of  the  Figure  12  as  the  sample  size  decreases  and  the  interval 
between  data  points  increases. 

The  ROCOF  of  subsystem  1  of  a  USAF  weapon  system  is  plotted  against 
operating  hours,  Figure  12,  and  calendar  days,  Figure  13.  The  plot  against  operating 
hours  shows  impacts  to  the  ROCOF  that  are  related  to  systems  usage.  The  plot  against 
calendar  days  shows  impact  to  the  ROCOF  due  to  events  that  are  external  to  the  systems. 
[9] 
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Figure  12  ROCOF  Plotted  Verses  Operational  Hours 
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Figure  13  Plot  of  ROCOF  Verses  Calendar  Date 
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ROCOF  Compared  to  USAF  Calculation  of  MTBF  as  a  Measure  of  Reliability 

The  USAF  requires  annual  reporting  of  MTBF  for  weapon  systems.  The  purpose 
of  the  reporting  is  to  monitor  the  reliability  performance  of  the  weapon  system.  The 
MTBF  is  often  incorrectly  analyzed  as  a  failure  rate  because  the  units,  hours/failures, 
seem  plausible.  By  definition  the  failure  rate  units  are  the  inverse,  failures/hours.  The 
inverse  of  the  USAF  calculation  of  MTBF  will  be  used  to  make  the  units  consistent  for 
comparison  with  the  ROCOF. 

Two  years  of  the  MCF  and  ROCOF  for  a  subsystem  of  a  USAF  weapon  system 
and  the  inverse  of  the  USAF  calculation  of  quarterly  and  annual  MTBF  are  plotted  in 
Figure  14.  This  plot  demonstrates  the  improved  response  of  the  ROCOF  compared  to  the 
MTBF  derived  ‘failure  rate’. 

From  the  plot  it  can  be  seen  that  using  1/MTBF  as  the  failure  rate  obscures  much 
of  the  deviation  in  the  ROCOF.  Many  weapon  system  subsystem  are  repaired  by 
replacing  very  expensive  repairable  units.  Each  spike  in  the  ROCOF  represents  tens  of 
thousands  of  dollars  worth  of  replacement  parts.  As  seen  in  the  event  timeline  in  Figure  8 
the  failures  are  often  clustered  creating  spikes  in  the  ROCOF.  Root  cause  analysis  can 
identify  mitigations  and  save  millions  of  dollars  for  replacement  parts  and  improve 
weapon  system  availability  by  reducing  the  amount  of  unscheduled  maintenance.  But  the 
root  cause  analysis  cannot  be  done  if  the  problem  is  obscured  by  using  MTBF  as  the 
measure  of  reliability. 
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It  appears  from  the  plot  that  the  magnitude  of  the  1/MTBF  is  larger  than  would  be 
a  trend-line  through  the  ROCOF.  The  magnitude  is  not  a  pure  comparison  as  the  MTBF 
calculation  is  scaled  by  the  operating  hours  in  the  period  while,  in  this  plot,  the  ROCOF 
is  a  function  of  calendar  time. 


Figure  14  Comparison  of  ROCOF  vs.  Annual  and  Quarterly  MTBF  ‘Failure  Rate’ 
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Data  Censoring 

Censored  data  is  also  sometimes  referred  to  as  truncated  or  suspended  data.  Two 
types  of  censoring  can  occur  in  the  data  set.  Right  censoring  occurs  when  the  time  to 
failure  of  a  specific  unit  is  not  known  due  to  the  data  period  ending  before  failure.  Left 
censoring  occurs  when  the  system  was  operating  for  some  time  before  data  is  collected. 

At  the  start  of  the  data  period  the  population  has  some  unknown  operating  time  and 
unknown  failure  history. 

The  demonstration  presented  in  this  chapter  utilized  two  years  of  failure  data  for  a 
USAF  weapon  system  that  has  been  fielded  for  more  than  fifteen  years.  A  complete  set  of 
recurrent  event  data  for  the  weapon  system  is  not  available.  An  arbitrary  starting  point 
was  chosen  within  the  range  of  available  data.  This  created  a  false  point  in  time  where  all 
end  items  in  the  population  appear  to  have  been  fielded  at  once  with  zero  operational 
hours.  The  actual  time  to  first  failure  in  the  period  of  analysis  is  not  known  for  any  items 
in  the  population.  The  time  from  the  beginning  of  the  period  of  analysis  to  the  first  failure 
in  the  period  is  used  in  the  calculation  of  MCF  per  operating  hour  as  if  it  were  the  actual 
time  to  first  occurrence.  The  left  data  censoring  is  a  source  of  error  in  the  magnitude  of 
the  MCF  and  ROCOF.  The  MCF  and  ROCOF  would  appear  to  be  higher  than  they 
actually  are  for  low  operating  hours  due  to  the  time  to  first  failure  being  truncated  for  all 
items  in  the  population.  The  error  decreases  as  the  data  analysis  period  increases  beyond 
the  ‘typical’  failure  free  period  of  the  end  items.  Figure  15  illustrates  the  order  of 
magnitude  of  the  left  censoring  error  in  this  chapter.  The  MCF  calculated  from  one  year 
of  data  increases  at  a  much  faster  rate  than  the  MCF  calculated  from  two  years  of  data. 
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The  weapon  system  continues  to  be  in  use  so  end  items  have  continued  to  accrue 
operating  time  after  the  period  of  analysis  data.  The  time  to  next  failure  after  the  last 
failure  in  the  data  analysis  period  is  unknown.  This  is  right  data  censoring.  Right 
censored  data  is  not  included  in  the  calculation  of  MCF  by  JMP.  Right  censoring  could 
be  a  significant  source  of  error  if  there  are  a  significant  number  of  end  items  in  the 
population  that  have  long  failure  free  periods  relative  to  the  data  analysis  period.  This 
would  make  the  MCF  appear  higher  than  it  really  is  due  to  the  most  reliable  systems 
being  omitted  from  the  MCF  calculation.  Ideally  the  MCF  and  ROCOF  would  be  updated 
regularly.  This  periodic  extension  of  the  period  of  analysis  would  keep  decreasing  the 
percentage  of  the  error. 

The  USAF  calculation  of  MTBF  is  more  susceptible  to  both  left  and  right  data 
censoring  than  the  calculation  of  MCF.  A  calculation  of  MTBF  for  the  first  year  of  the 
data  analysis  period  would  have  the  same  left  censoring  error  as  the  MCF  but  would  also 
have  a  similar  error  from  right  censoring.  The  percentage  of  error  becomes  larger  as  the 
period  of  calculation  of  MTBF  decreases  until  the  point  where  the  number  is  undefined 
(zero  failures  in  the  period). 
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Left  Censoring  Effect 
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Figure  15  Illustration  of  MCF  Error  Due  to  Left  Data  Censoring 
Investigative  Questions  Answered. 

“Based  on  USAF  repairable  system  recurrence  data,  how  can  reliability  best  be 
non-parametrically  measured?” 

Reliability  of  fielded  USAF  repairable  systems  should  be  measured  using  the 
basic  operations  of  recurrent  event  data  analysis  advocated  by  Nelson  [27],  and  the 
recurrent  events  data  analysis  process  adapted  from  Trindade  and  Nathan  [9]  as  presented 
in  Chapter  III  and  demonstrated  in  Chapter  IV.  This  measurement  process  could  be 
implemented  using  existing  USAF  maintenance  data  but  would  be  much  more  effective 
with  an  enterprise  reliability  data  architecture  to  support  it. 
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Summary 

It  has  been  shown  that  the  USAF  calculation  of  MTBF  is  not  an  effective  measure 
of  fielded  repairable  systems.  The  basic  assumptions  that  are  necessary  for  the  USAF 
calculation  of  MTBF  to  be  applicable  (iid)  are  rarely  reasonable  for  fielded  repairable 
systems.  USAF  guidance  and  policy  does  not  mention  that  there  are  assumptions  that 
must  be  verified  and  stated  nor  are  alternatives  measures  discussed. 

When  MTBF  is  applied  inappropriately  it  is  a  lagging  metric  that  misses 
developing  trends  and  short  term  spikes  in  ROCOF.  If  the  period  of  the  MTBF 
calculation  is  shortened  to  reduce  latency  the  error  increases  due  to  left  and  right  data 
censoring.  If  the  period  is  extended  to  reduce  censoring  the  smoothing  and  latency 
increases. 

The  nonparametric  recurrent  event  data  analysis  process  of  this  chapter  shows  that 
the  reliability  of  fielded  repairable  systems  and  the  first  level  of  indentured  subsystems 
can  be  measured  in  near  real  time.  The  technique  does  not  require  complex  statistical 
models  that  require  parameterization.  Complex  numerical  techniques  are  not  required  to 
solve  specialized  mathematics.  No  new  data  elements  or  sources  of  data  are  required. 

However  there  are  complications  to  using  the  nonparametric  recurrent  event  data 
analysis  process  in  the  current  environment.  The  existing  USAF  data  is  not  readily 
available  in  the  quantity  and  format  necessary.  There  is  a  substantial  time  investment  to 
set  up  the  necessary  data  and  it  must  be  updated  regularly  to  take  advantage  of  the  near 
real  time  issue  identification  capability.  Today  that  investment  must  be  made  for  every 
system  to  be  considered.  As  much  of  the  lifecycle  failure  event  history  as  possible  should 
be  included  in  the  analysis  period  to  minimize  the  left  censoring  error. 
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V.  Conclusions  and  Recommendations 


Chapter  Overview 

The  DoD  has  increased  the  emphasis  on  reliability  in  order  to  reduce  sustainment 
costs  and  weapon  system  availability.  But  the  emphasis  is  on  the  development  phase 
where  there  is  a  mandatory  and  fonnalized  reliability  program.  There  are  millions  of 
dollars  to  be  saved  and  significant  improvement  available  in  A0  by  improving  the 
reliability  of  fielded  legacy  weapon  systems.  There  is  little  focus  on  reliability  in  the 
sustainment  phase.  The  very  definition  of  reliability  in  the  USAF  does  not  have  relevance 
to  fielded  repairable  systems.  MTBF  is  defined  as  the  metric  to  report  the  status  of 
weapon  system  reliability  but  it  does  not  provide  the  intended  information  and  is  likely 
calculated  incorrectly. 

The  opportunity  exists  to  utilize  existing  data  to  measure  the  reliability  of  fielded 
repairable  systems.  Anecdotal  evidence  suggests  valuable  reliability  improvement  can  be 
made,  without  redesigning  systems,  by  accurate  and  timely  analysis  of  the  MCF  and 
monitoring  of  fielded  repairable  system  ROCOF. 
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Conclusions  of  Research 


Effective  measurement  and  improvement  of  weapon  system  reliability  in  the 
sustainment  phase  requires  accurate  and  timely  data  documenting  the  lifecycle  of 
individual  items  and  specific  material.  Effective  measurement  and  improvement  of 
weapon  system  reliability  in  the  sustainment  phase  requires  expert  application  of 
pertinent  analysis  to  that  data. 

There  is  no  USAF  policy  or  guidance  for  the  analysis  of  reliability  data  after 
fielding.  The  USAF  definition  of  reliability  is  not  coherent  with  the  desired  operational 
outcome  or  rigorous  statistical  analysis.  The  policy  and  guidance  does  not  make  a 
distinction  between  the  context  areas  as  presented  in  Table  1.  There  may  be  areas  of 
reliability  expertise  and  effective  reliability  data  analysis  within  the  USAF  sustainment 
community  but  it  is  dependent  on  the  priorities  of  the  weapon  system  management. 

MTBF  is  not  an  effective  metric  for  the  reliability  of  fielded  repairable  systems  if 
the  purpose  for  measurement  is  for  reliability  improvement.  The  USAF  calculation  of 
MTBF  for  a  windowed  period  of  the  lifecycle  is  not  the  mean  operational  time  between 
failures  in  the  period  due  to  left  and  right  data  censoring.  It  is  a  lagging  indicator  that 
tends  to  obscure  important  trends  and  indicators  in  the  data. 

A  method  for  non  parametric  analysis  of  recurrent  events  is  well  defined  in 
literature.  It  is  applicable  to  reliability  of  fielded  repairable  systems.  The  USAF  has  data 
available  that  could  be  used  for  recurrent  event  analysis  at  least  to  the  first  level  of 
indenture  below  the  end  item.  However  the  lack  of  operational  time  tracking  of  serialized 
items  below  the  end  item  level  limits  the  ability  to  apply  the  process  at  lower  levels  of 
indenture. 
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Significance  of  Research 

It  appears  that  the  USAF  reliability  program  is  not  a  priority.  There  is  no  person 
responsible  for  reliability  at  the  AF  level.  It  is  not  clear  who  is  responsible  for  reliability 
in  AFMC.  There  is  no  USAF  standard  architecture  for  reliability  data  collection  or 
analysis.  The  requirement  for  measuring  fielded  system  reliability  is  identified  as  the 
responsibility  of  the  PM  in  several  AF  documents. 

USAF  reliability  metric  is  typically  reported  on  an  annual  basis  at  such  a  high 
level  that  no  one  realizes  the  wrong  metric  is  used  and  the  wrong  metric  is  calculated 
incorrectly.  The  error  generally  does  not  have  an  impact  because  the  data  is  not 
actionable.  If  an  effort  is  made  to  improve  reliability  the  metric  cannot  track  results. 

The  correct  data  analysis  process  for  fielded  repairable  USAF  systems  is 
available.  The  process  has  the  capability  to  identify  specific  poor  perfonning  end  items 
and  subsystems  for  improvement.  The  process  has  the  capability  to  compare  the 
respective  reliability  of  subsets  of  weapon  system  population  based  on  such  parameters  as 
location,  using  command,  different  configurations,  ....  This  capability  allows  root  cause 
identification  and  accurate  measure  of  the  impacts  of  reliability  improvement  efforts  or 
other  modifications. 

Recommendations  for  Action 

Conduct  a  study  to  determine  if  the  effort  and  expense  being  invested  in  the 
acquisition  reliability  programs  are  producing  the  intended  result  in  the  fielded  systems. 
The  DoD  has  renewed  emphasis  on  reliability  in  order  to  reduce  sustainment  costs  but  the 
focus  has  been  on  reliability  prediction  and  test  during  the  development  phase  of 
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programs,  column  1  of  Table  1.  Studies  have  characterized  the  success  or  failure  of 
reliability  programs  by  the  Operational  Test  and  Evaluation  (OT&E)  results.  No  attempt 
has  been  made  to  examine  if  successful  OT&E  reliability  results  correlate  with  effective 
reliability  of  the  fielded  system. 

Adopt  the  standard  definition  of  reliability  with  the  four  important  elements;  1 . 
“The  probability  of’  2.  “an  item  to  perform  a  required  function”  3.  “under  stated 
conditions”  4.  “for  a  specified  period  of  time.”  [15]  [16].  Remove  all  reference  to 
calculation  of  reliability  metrics  (MTBF)  from  the  definition  of  reliability. 

Suspend  the  use  of  MTBF  as  ‘the’  measure  of  reliability.  The  use  of  MTBF  as  the 
only  measure  trivializes  a  very  complex  topic.  The  correct  measure  depends  on  the 
context  of  the  requirement  and  the  data  source  as  described  in  Table  1.  The  USAF  has 
many  initiatives  to  improve  reliability  for  many  purposes.  There  must  be  experts 
available  to  recommend  the  appropriate  analysis  and  measure  depending  on  the  context. 

Reliability  of  fielded  USAF  repairable  systems  should  be  measured  using  the 
basic  operations  of  recurrent  event  data  analysis  advocated  by  Nelson  [27],  and  the 
recurrent  events  data  analysis  process  adapted  from  Trindade  and  Nathan  [9]  as  presented 
in  Chapter  III  and  demonstrated  in  Chapter  IV.  Make  the  recurrent  events  analysis 
process  available  with  access  to  the  relevant  data.  The  USAF  has  positioned  LIMS-EV  as 
the  single  source  of  truth  for  enterprise  data.  LIMS-EV  should  have  a  view  for 
sustainment  data  that  is  an  interface  to  recurrent  event  analysis  capability  to  the  lowest 
repair  level  of  serialized  items.  The  research  for  this  thesis  required  much  manual  data 
correlation  and  fonnatting.  All  of  that  could  be  done  within  GCSS-AF  and  made 
available  across  the  enterprise. 
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Recommendations  for  Future  Research 


1 .  It  is  possible  to  mechanize  the  recurrent  event  analysis  process  and  output 
notifications  of  reliability  issues  (MCF  trends  or  ROCOFspikes).  The  USAF  office, 
AF/A4ID  sponsored  a  pathfinder  project  to  demonstrate  the  capability  to  automate 
recurrent  event  analysis.  The  project  produced  an  IT  tool  that  analyzed  massive  amounts 
of  maintenance  data  comparing  the  current  state  with  historical  data  to  automate  near  real 
time  identification  of  abnonnal  events.  [37] 

2.  Accurate  sequence  and  time  to  failure  data  is  not  generally  available  below  the 
first  level  of  indenture  for  USAF  weapon  systems.  It  appears  that  the  operating  time  data 
attribute  is  available  in  the  USAF  maintenance  data  collection  system  for  all  serialized 
items.  But  it  appears  that  it  is  not  a  required  element,  that  there  is  no  data  there  unless 
manually  entered  by  the  technician.  The  required  data  is  available  within  the  USAF 
enterprise  and  accurate  population  of  that  data  element  could  be  automated  as  it  is  for  end 
items. 

USAF  serialized  maintenance  data  is  currently  entered  manually  with  no  error 
checking.  Research  for  this  thesis  reviewed  serialized  history  for  components  and  found 
that  there  are  significant  numbers  of  incorrect  serial  numbers  in  the  data.  One  very 
critical  and  expensive  component  has  20%  more  serial  numbers  in  the  system  than  items 
in  the  inventory.  This  creates  a  maintenance  record  for  items  that  do  not  exist  and  omits 
valuable  data  from  the  record  of  existing  items. 

Many  USAF  subsystems  are  made  up  of  complex,  irreplaceable,  repairable  units. 
Some  of  these  units  are  worth  millions  of  dollars  and  cost  hundreds  of  thousands  to 
repair.  To  accurately  measure  the  reliability  of  those  items  the  data  gap  must  be  bridged. 
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The  existing  data  architecture  could  be  researched  and  compared  to  the  data  elements 
necessary  for  effective  reliability  measurement  for  all.  The  user  interface  and  GCSS-AF 
interfaces  could  be  researched  to  identify  sources  and  impacts  of  errors. 

Summary 

The  DoD  and  USAF  definition  of  reliability  and  the  required  reporting  metric 
should  be  reconsidered.  The  definition  should  be  broad  enough  to  include  all  of  the 
context  areas  presented  in  Table  1  and  the  required  metrics  should  be  tailored  to  the 
specific  context.  MTBF  is  not  an  effective  metric  for  measurement  of  repairable  USAF 
systems. 

The  DoD  intention  is  to  improve  reliability  in  sustainment  but  all  of  the  effort  is 
aimed  at  acquisition  programs.  The  USAF  can  work  toward  the  DoD  goal  of  reducing 
sustainment  costs  and  improving  weapon  system  A0  by  improving  reliability  without 
limiting  the  improvement  to  current  and  future  acquisition  programs.  The  reliability  of 
fielded  repairable  systems  may  be  improved  with  effective  measurement  and  analysis. 

Nonparametric  recurrent  event  data  analysis  is  the  correct  process  to  use  for 
assessment  of  the  reliability  of  fielded  repairable  systems.  The  necessary  USAF  data  to 
implement  the  process  at  the  subsystem  level  is  currently  collected.  Many  weapon 
systems  have  historical  records  of  the  data.  But  the  data  is  not  readily  accessible  to 
reliability  analysts.  The  required  knowledge,  software  tools  and  resources  are  not 
generally  available  across  the  USAF  enterprise.  But  as  this  paper  has  shown,  limited 
analysis  can  be  done  if  the  weapon  systems  managers  want  to  invest  the  resources. 
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Appendix  A 


Data  Extraction  from  LIMS-EV  [Code  3  Breaks] 

LIMS-EV  Weapon  System  View 
WUC  Tab 

1.  At  Time  Increment:  select  ‘By  Month’  (Select  month  with  slider). 

2.  At  Time  Increment:  select  ‘By  Day’  (Adjust  to  desired  period  with 
calendar  pop-up.) 

3.  Select  desired  population  as  appropriate. 

4.  At  Hours/Numeric/Mean  Time  buttons  select  ‘Numeric’. 

5.  At  Metric:  select  ‘Code  3  Breaks’. 

6.  At  WUC  Digit:  select  ‘3  Digit’. 

7.  At  Driving  WUC/LCN/Ref  Des:  select  appropriate  WUC. 

8.  Click  the  Update  button. 

9.  At  the  data  table  View  By:  select  ‘3  Digit  WUC’. 

10.  Check  the  Transpose  Grid  box. 

1 1.  At  the  data  table  Export  Grid  dropdown  select  ‘View  -  metrics 
selected  in  the  grid’. 

12.  Save  the  export  file. 

13.  Repeat  from  1  for  each  month  until  the  desired  period  is  exported. 

Data  Extraction  from  LIMS-EV  [Flying  Hours] 

LIMS-EV  Weapon  System  View 
Utilization  Tab 

1.  At  Time  Increment:  select  ‘By  Month’  (Select  month  with  slider). 

2.  At  Time  Increment:  select  ‘By  Day’  (Adjust  to  desired  period  with 
calendar  pop-up.) 

3.  Select  desired  population  as  appropriate. 

4.  At  Rate/Hours/Numeric  buttons  select  ‘Hours’. 

5 .  Click  the  Update  button. 

6.  At  the  data  table  View  By:  select  ‘Serial  Number’. 

7.  Check  the  Transpose  Grid  box. 

8.  At  the  data  table  Export  Grid  dropdown  select  ‘View  -  metrics 
selected  in  the  grid’. 

9.  Save  the  export  file. 

10.  Repeat  from  1  for  each  month  until  the  desired  period  is  exported. 
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